Reproducible reporting

An introduction to Quarto

Division of Pharmacoepidemiology and Pharmacoeconomics
Brigham and Women’s Hospital
Harvard Medical School

August 25, 2024

Problem statement

Wait, but how was that done exactly?

Problem statement (i)

Wait, but how was that done exactly?

  • More often than not, statistical and computational methods are reported and phrased ambiguously, e.g.,

    “We measured the pre-exposure performance status within 90 days of the index date.”

  • Does the 90-day window include or exclude the index date? What was done if there were multiple performance assessments per patient? …

  • Take a moment and reflect if you would be able to exactly reproduce a study you published 10 years just based on the paper’s methods section?

Problem statement (ii)

Wait, but how was that done exactly?

One could find the details in the analytical programming code, BUT…

Is there a reproducibility crisis?

Nature survey 2016: More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments (Baker 2016)

What if…

What if…

If there was just a way to combine…

  • the narrative prose that explains the methods used

  • the analytic code we implemented to execute these methods

  • the corresponding results

…all in one report?

Literate programming

Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do (Donald Knuth, Turing Award recipient)

Definition

It is basically an annotated, executable manuscript!

Literate programming

Programming paradigm introduced in 1984 by Donald Knuth in which a computer program is given as an explanation of how it works in a natural language, such as English, interspersed (embedded) with snippets of macros and traditional source code, from which compilable source code can be generated.

(Knuth 1984)

In other words…

\[ \text{Literate programming} = \text{Documentation + Source Code + Output/Results} \]

Example

Methods section text:

“A propensity score model for exposure initiation was fit using logistic regression with age, sex and smoking as covariates. Patients were matched using nearest neighbor matching on the propensity score in a 1:1 ratio without replacement targeting the average treatment effect among the treated (ATT).”

MatchIt::matchit(
  formula = exposure ~ age_num + female_cat + smoking_cat,
  data = smdi::smdi_data,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  estimand = "ATT",
  replace = F
  )
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 2500 (original), 1996 (matched)
 - target estimand: ATT
 - covariates: age_num, female_cat, smoking_cat

History of literate programming

  • Literate programming is a concept pioneered by Donald Knuth, a Turing Award recipient known for creating TeX.

  • The main idea behind the early form of literate programming was to upend the traditional programming practices of the time by systematically including human readable text accompanying and explaining the logic and the purpose of a program.

  • As he describes in “Literate Programming”, Knuth considers the programmer as an “essayist” who should strive to communicate the purpose of a program in order to create better code.

  • While initially centered in the domain of computer science, it more recently resurged in the interdisciplinary world of “data science”.

https://bernhardbieri.ch/blog/2022-08-25-litteralprogramminginstata/

Quarto

Dynamic study reporting

Introduction to Quarto

  • An open-source scientific and technical publishing system

Side-by-side example of a Quarto document (left) and a rendered .html output

Introduction to Quarto

  • Unifies the functionality of many tools, packages and open source platforms into a single consistent system

  • Extends it with native support for a large number of open-source programming languages (R, Python, Julia, Stan, C++, etc.)

  • Can be used with most common code editors (RStudio, Jupyter, VSCode, etc.)

  • Proprietary programming languages (SAS, STATA) can also be integrated but require some additional setup

    • Additional resources for use of Quarto with SAS (setup, demo) and STATA can be found on the course website

Goal: single source publishing

  • Since Quarto is a single source reporting system, we are not constraint to only output one document type but multiple given the same source document

  • Example: Manuscript written for a journal, we can also render it into a website

From raw code and text to an elegant research report using Quarto

Quarto - ingredients for a research report (i)

  • First we provide metadata about the project in a so called YAML header (Yet Another Markdown Language)
  • Documentation on all YAML options can be found on https://quarto.org/docs/reference/
---
title: "My RWE study report"
author: "Janick Weberpals"
date: last-modified
toc: true
code-fold: true
number-sections: true
bibliography: references.bib
csl: pharmacoepidemiology-and-drug-safety.csl
format: 
  html: default
  docx: default
  pdf: default
---

Quarto - ingredients for a research report (ii)

Plain text

  • To describe objectives, methods (just like in a .docx document)

  • Achieved via Markdown syntax

  • ✅ Universal and reproducible formatting across output document types

  • ❌ Syntax needs to be known (although many modern editors come with a GUI)

Quarto - ingredients for a research report (iii)

  • Code chunks make it possible to blend plain text, programming code and the corresponding output, e.g.,

    Methods section text:

    “The propensity score was defined as the probability of each patient to initiate the exposure based on observed baseline covariates including age, sex and smoking.”

Code chunk following description of propensity score model:

# define the model fit
ps_fit <- as.formula(exposure ~ age_num + female_cat + smoking_cat)
ps_fit
exposure ~ age_num + female_cat + smoking_cat

Quarto - ingredients for a research report (iii)

  • Code chunks are not limited to one programming language, but can accommodate multiple in the same document
  • Language is chosen by the {} parameter at the beginning of each code chunk
```{r}
print("This is R code")
```
[1] "This is R code"
```{python}
print("This is python code")
```
[1] "This is python code"

Reporting elements

Figures

Quarto enables the integration, labeling and cross-referencing of figures using @fig-missingness which becomes Figure 1.

```{r}
#| label: fig-missingness
#| fig-cap: "Proportion of missingness among covariates with at least one unobserved value"

library(smdi)

smdi_vis(smdi_data)
```

Figure 1: Proportion of missingness among covariates with at least one unobserved value

Flowcharts and diagrams

  • Quarto has native support for embedding Mermaid and Graphviz diagrams

  • This enables the depiction of flowcharts, sequence diagrams, state diagrams, gantt charts, and more using a plain text syntax inspired by markdown

```{mermaid}
flowchart LR
  C(Confounder) --> E(Exposure)
  C(Confounder) --> O(Outcome)
  E(Exposure) --> O(Outcome)
```
flowchart LR
  C(Confounder) --> E(Exposure)
  C(Confounder) --> O(Outcome)
  E(Exposure) --> O(Outcome)

Tables

Similarly, we can also create, label and cross-reference tables using @tbl-table1 which becomes Table 1.

```{r}
#| label: tbl-table1
#| tbl-cap: "Baseline patient characteristics."

library(gtsummary)

trial |>  
  tbl_summary(by = trt, include = c(age, grade))
```
Table 1:

Baseline patient characteristics.

Characteristic Drug A
N = 98
Drug B
N = 102
Age, Median (IQR) 46 (37 – 60) 48 (39 – 56)
    Unknown 7 4
Grade, n (%)

    I 35 (36) 33 (32)
    II 32 (33) 36 (35)
    III 31 (32) 33 (32)

Equations

  • Quarto also integrates LaTeX and styles to write equations such that …
$$\lambda(t | X) = \lambda_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)$$

… becomes:

\[ \lambda(t | X) = \lambda_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p) \]

Referencing

Quarto has an in-built referencing system for which the referencing style can be chosen in the YAML header

RStudio built-in reference manager
---
title: "My RWE study report"
bibliography: references.bib
csl: pharmacoepidemiology-and-drug-safety.csl
format: docx
---

In the 2009 publication, Schneeweiss et al. 
[@schneeweiss2009high] introduced the 
concept of high-dimensional propensity scores.

⬇️

“In his 2009 publication, Schneeweiss et al. (Schneeweiss et al. 2009) introduced the concept of high-dimensional propensity scores.”

Inline code

  • Inline code allows to execute code within markdown, e.g. to automatically use the most up-to-date computations in narrative.
```{r}
# we determine the sample size in a code chunk to integrate into our narrative report
sample_size <- nrow(smdi_data)
```

The sample size comprised `{r} sample_size` patients.
  • In a dynamic study report, we now don’t need to manually copy-paste numbers, but only reference the resulting object of the computations performed in the code chunk, which then automatically shows the most-up-to date number:

The sample size comprised 2500 patients.

Interactive reports

If we render study reports to output formats that support interactive elements (e.g., .html), Quarto provides even more tools to make study reporting interactive

Tabsets

Tabset allows us to divide content into multiple tabs for interactive exploration


::: panel-tabset

## Age distribution


## Biomarker distribution

:::

Figure 2: Age by exposure status

Figure 3: Biomarker expression by exposure status

Code folding

Sometimes it may not be desired to see the entire code to enable readers to focus on the text and results but still have the ability to see the code

#| code-fold: true

library(ggplot2)

dataset |> 
  ggplot(aes(x = exposure, y = age_num, fill = exposure)) +
  geom_boxplot(alpha = 0.5) +
  theme_minimal() +
  labs(
    x = "Exposure status",
    y = "Years of age",
    fill = "Exposure"
  )
Code
library(ggplot2)

dataset |> 
  ggplot(aes(x = exposure, y = age_num, fill = exposure)) +
  geom_boxplot(alpha = 0.5) +
  theme_minimal() +
  labs(
    x = "Exposure status",
    y = "Years of age",
    fill = "Exposure"
  )

Figure 4: Age by exposure status

Parameterized reporting (i)

  • RWE studies usually include many sensitivity analyses to check the robustness of certain assumptions and models in the main analysis

  • Often it is just one or few parameters that have to bee changed

  • Copy-pasting code back and forth is very error-prone and should be avoided

Case study

Let’s say we do propensity score matching with a certain caliper and want to run a sensitivity analysis with a different caliper but we don’t want to copy-paste any of the code

Parameterized reporting (ii)

  • The YAML header that contains the metadata about the study report also has an option to define study parameters that can be flexibly changed

  • Let’s say we do propensity score matching with a certain caliper and want to run a sensitivity analysis with a different caliper

---
title: "My RWE study report"
params:
  ps_caliper: 0.05
---
  • The actual caliper is now replaced with params$ps_caliper
MatchIt::matchit(
  formula = exposure ~ age_num + female_cat + smoking_cat,
  data = smdi::smdi_data,
  caliper = params$ps_caliper
  )

Quarto-based research reports

An example of a Quarto-based research report can be found in the course materials website

Quarto summary

Quarto …

  • Is a technical publishing system compatible with most programming languages and editors

  • Is a single source reporting system that can produce many different types of outputs (.docx, .pdf, .html, websites, presentations, etc.)

  • Main ingredients: YAML header (metadata), text, code chunks

  • Can be used as to blend narrative text, programming code and output in one document

  • Has all capabilities of common reporting systems (e.g., MS Word) and many that go beyond (inline coding, dynamic and interactive elements, …)

  • May be useful to parameterize analysis pipelines avoiding error-prone copy-pasting

Further resources

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://doi.org/10.1038/533452a.
Knuth, Donald Ervin. 1984. “Literate Programming.” The Computer Journal 27 (2): 97–111.
Schneeweiss, Sebastian, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. 2009. “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data.” Epidemiology 20 (4): 512–22.